Search CORE

Clustering exact matches of pairwise sequence alignments by weighted linear regression

Author: Alvaro J González
F Sanger
Li Liao
PA Pevzner
S Kurtz
SF Altschul
TF Smith
WJ Kent
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless. Results We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest. Conclusion This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.</p

How accurately is ncRNA aligned within whole-genome multiple alignments?

Author: A Prakash
A Prakash
A Siepel
Adrienne X Wang
DA Pollard
DA Pollard
E Rivas
E Torarinsson
EH Margulies
G Bourque
J Pei
JD Thompson
JD Thompson
JD Thompson
L Wang
M Blanchette
M Brudno
M Cline
M Errami
Martin Tompa
MS Rosenberg
S Batzoglou
S Griffiths-Jones
S Griffiths-Jones
S Karlin
S Kumar
S Schwartz
S Washietl
SR Eddy
SR Eddy
T Lassmann
W Miller
Walter L Ruzzo
WJ Kent
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Multiple alignment of homologous DNA sequences is of great interest to biologists since it provides a window into evolutionary processes. At present, the accuracy of whole-genome multiple alignments, particularly in noncoding regions, has not been thoroughly evaluated. Results We evaluate the alignment accuracy of certain noncoding regions using noncoding RNA alignments from Rfam as a reference. We inspect the MULTIZ 17-vertebrate alignment from the UCSC Genome Browser for all the human sequences in the Rfam seed alignments. In particular, we find 638 instances of chimeric and partial alignments to human noncoding RNA elements, of which at least 225 can be improved by straightforward means. As a byproduct of our procedure, we predict many novel instances of known ncRNA families that are suggested by the alignment. Conclusion MULTIZ does a fairly accurate job of aligning these genomes in these difficult regions. However, our experiments indicate that better alignments exist in some regions.</p

arXiv.org e-Print Archive

Shape-based peak identification for ChIP-Seq

Author: A Barski
AA Bhinge
B Wold
EG Wilbanks
ET Wang
G Carlsson
G Robertson
GR Grimmett
J Rozowsky
Lior Pachter
M Lupien
MB Noyes
PJ Park
R Development Core Team
RK Bradley
S Bhamidi
S Evans
S MacArthur
S Pepke
SN Evans
Steven N Evans
T Barrett
T Laajala
Valerie Hower
WJ Kent
Y Benjamini
Y Benjamini
Y Zhang
Publication venue
Publication date: 05/05/2010
Field of study

We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.htmlComment: 12 pages, 6 figure

eScholarship - University of California

Caltech Authors

Developing and applying heterogeneous phylogenetic models with XRate

Author: A Heger
A Siepel
A Varadarajan
AJ Drummond
B Knudsen
B Knudsen
Christos A. Ouzounis
D Ayres
DB Searls
E Birney
G Lunter
GSC Slater
Ian Holmes
IM Meyer
J Felsenstein
J Goecks
J Watts
JS Pedersen
L Stein
M Garber
M Hasegawa
M Kimura
M Zuker
ME Skinner
N Saitou
O Penn
Oscar Westesson
PS Klosterman
RK Bradley
SR Eddy
TH Jukes
WJ Kent
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 16/02/2012
Field of study

Modeling sequence evolution on phylogenetic trees is a useful technique in computational biology. Especially powerful are models which take account of the heterogeneous nature of sequence evolution according to the "grammar" of the encoded gene features. However, beyond a modest level of model complexity, manual coding of models becomes prohibitively labor-intensive. We demonstrate, via a set of case studies, the new built-in model-prototyping capabilities of XRate (macros and Scheme extensions). These features allow rapid implementation of phylogenetic models which would have previously been far more labor-intensive. XRate's new capabilities for lineage-specific models, ancestral sequence reconstruction, and improved annotation output are also discussed. XRate's flexible model-specification capabilities and computational efficiency make it well-suited to developing and prototyping phylogenetic grammar models. XRate is available as part of the DART software package: http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog

arXiv.org e-Print Archive

FigShare

Rapid pair-wise synteny analysis of large bacterial genomes using web-based GeneOrder4.0

Author: A Kaluszka
C Bru
Donald Seto
H Tettelin
J Tamames
MY Galperin
Padmanabhan Mahadevan
R Lavigne
R Lavigne
R Mazumder
R Overbeek
S Celamkoti
WJ Kent
Y Zheng
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

GIVE: portable genome browsers for personal websites.

Author: Alvin Zheng
B Sridhar
B Sridhar
C Tyner
D Barrios
D Comer
E Lieberman-Aiden
E Sharma
F Ozsolak
F Yue
FH Biase
JD Buenrostro
JG Aw
JT Robinson
LD Stein
ME Skinner
MJ Fullwood
Qiuyang Wu
R Bayer
R Li
R Mourad
S Carrere
Sheng Zhong
TC Nguyen
The ENCODE Project Consortium
VW Zhou
WJ Kent
X Li
X Zhou
Xiaoyi Cao
Z Lu
Zhangming Yan
Publication venue: eScholarship, University of California
Publication date: 01/07/2018
Field of study

Growing popularity and diversity of genomic data demand portable and versatile genome browsers. Here, we present an open source programming library called GIVE that facilitates the creation of personalized genome browsers without requiring a system administrator. By inserting HTML tags, one can add to a personal webpage interactive visualization of multiple types of genomics data, including genome annotation, "linear" quantitative data, and genome interaction data. GIVE includes a graphical interface called HUG (HTML Universal Generator) that automatically generates HTML code for displaying user chosen data, which can be copy-pasted into user's personal website or saved and shared with collaborators. GIVE is available at: https://www.givengine.org/

eScholarship - University of California

Bovine Polledness – An Autosomal Dominant Trait with Allelic Heterogeneity

Author: A Capitan
A Capitan
Alexander Graf
B Graf
B McEvoy
C Charlier
C Drogemuller
CR Long
D Seichter
DC Koboldt
DL Stern
Doris Seichter
E Pailhoux
H Li
H Li
Helmut Blum
HM Wood
I Medugorac
Ingolf Russ
Ivica Medugorac
J Goecks
J Ramljak
K Lindblad-Toh
Karl Heinrich Göpel
KC Prayaga
L Grobet
LK Matukumalli
M Asai
M Blanchette
M Felius
M Georges
M Mariasegaram
Martin Förster
P Bialek
R Ihaka
RA Gibbs
RC Gentleman
RS Harris
SE Johnston
Shuhong Zhao
Sophie Rothammer
SR Browning
Stefan Krebs
W White
WF Dove
WJ Kent
WJ Kent
WJ Kent
Y Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

The persistent horns are an important trait of speciation for the family Bovidae with complex morphogenesis taking place briefly after birth. The polledness is highly favourable in modern cattle breeding systems but serious animal welfare issues urge for a solution in the production of hornless cattle other than dehorning. Although the dominant inhibition of horn morphogenesis was discovered more than 70 years ago, and the causative mutation was mapped almost 20 years ago, its molecular nature remained unknown. Here, we report allelic heterogeneity of the POLLED locus. First, we mapped the POLLED locus to a ∼381-kb interval in a multi-breed case-control design. Targeted re-sequencing of an enlarged candidate interval (547 kb) in 16 sires with known POLLED genotype did not detect a common allele associated with polled status. In eight sires of Alpine and Scottish origin (four polled versus four horned), we identified a single candidate mutation, a complex 202 bp insertion-deletion event that showed perfect association to the polled phenotype in various European cattle breeds, except Holstein-Friesian. The analysis of the same candidate interval in eight Holsteins identified five candidate variants which segregate as a 260 kb haplotype also perfectly associated with the POLLED gene without recombination or interference with the 202 bp insertion-deletion. We further identified bulls which are progeny tested as homozygous polled but bearing both, 202 bp insertion-deletion and Friesian haplotype. The distribution of genotypes of the two putative POLLED alleles in large semi-random sample (1,261 animals) supports the hypothesis of two independent mutations

CiteSeerX

Public Library of Science (PLOS)

Multiple organism algorithm for finding ultraconserved elements

Author: A Sandelin
A Siepel
A Woolfe
AL Delcher
AL Delcher
B Ma
CF Cheung
D Gusfield
D Lawson
EA Glazov
EH Margulies
G Bejerano
Greg Madey
HW Mewes
JC Venter
JZ Ni
LD Stein
M Brudno
MI Abouelhoda
N Bray
Neil F Lobo
P Ferragina
RA Holt
S Kurtz
S Kurtz
S Schwartz
Scott Christley
SF Altschul
T Tran
TJP Hubbard
U Manber
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality. Results We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences. Conclusion Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression

Author: A Wagner
AE Kel
B Lenhard
B Shea
BP Berman
DA Papatsenko
DS Prestridge
DS Prestridge
F Larsen
GD Stormo
GG Loots
JA Warrington
JM Claverie
K Quandt
KD Pruitt
L Ponger
LL Hsiao
M Gardiner-Garden
MC Frith
MC Frith
MI Arnone
MS Halfon
N Rajewsky
O Johansson
R Ihaka
RR Sokal
S Aerts
S Hannenhalli
S Levy
S Levy
TD Schneider
V Matys
V Solovyev
W Krivan
WH Press
WJ Ewens
WJ Kent
WJ Kent
WW Wasserman
Y Suzuki
Y Suzuki
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions. RESULTS: We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI. CONCLUSION: Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions